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Abstract 

Severe background clutter is challenging in many com¬ 
puter vision tasks, including large-scale image retrieval. 
Global descriptors, that are popular due to their memory 
and search efficiency, are especially prone to corruption by 
such a clutter. Eliminating the impact of the clutter on the 
image descriptor increases the chance of retrieving relevant 
images and prevents topic drift due to actually retrieving the 
clutter in the case of query expansion. In this work, we pro¬ 
pose a novel salient region detection method. It captures, 
in an unsupervised manner, patterns that are both discrim¬ 
inative and common in the dataset. Saliency is based on a 
centrality measure of a nearest neighbor graph constructed 
from regional CNN representations of dataset images. The 
descriptors derived from the salient regions improve partic¬ 
ular object retrieval, most noticeably in a large collections 
containing small objects. 


1. Introduction 

Particular object retrieval becomes very challenging 
when the object of interest is covering a small part of the 
image. In this case, the amount of relevant information is 
significantly reduced. Large objects might be partially oc¬ 
cluded, while small objects are on a background that covers 
most of the image. A combination of both, occlusion and 
cluttered background, is not rare either. These conditions 
naturally arise from image acquisition and make naive ap¬ 
proaches fail, including global template matching or semi- 
robust template matching [2( ]. 

Ideally, image descriptors should be extracted only from 
the relevant part of the image, suppressing the irrelevant 
clutter and occlusions. In this paper, we attempt to de¬ 
termine the regions containing the relevant information, as 
shown in Figure 1 , in a fully unsupervised manner. 

Methods based on robust matching of hand-crafted lo¬ 
cal features are naturally insensitive to occlusion and back¬ 
ground clutter. The locality of the features allows to match 
small parts of images in regions containing the object of in- 



Figure 1. The saliency map (right) computed for an input image 
(left) based on common-structure analysis on Instre dataset. Back¬ 
ground clutter and objects not relevant for this dataset are auto¬ 
matically removed. The image is represented only by the region 
detected on the saliency map. 

terest, while the incorrect matches are typically removed by 
robust geometric consistency check [29]. Methods based 
on efficient matching of vector-quantized local-feature de¬ 
scriptors were introduced in context of image retrieval by 
Sivic and Zisserman [36]. 

Retrieval methods based on descriptors extracted by con¬ 
volutional neural networks (CNNs) have become popular 
because they combine good precision and recall, efficiency 
of the search, and reasonable memory footprint [5, 31]. 
Deep neural networks are capable of learning, to some 
extent, what information in the image is relevant, which 
results in a good performance even with global descrip¬ 
tors [40, 4, 18]. However, if the signal to noise ratio is 
low, e.g. the object is relatively small, multiple objects are 
present, etc., the global CNN descriptors fail [13, 12]. 

A class of methods inspired by object detection have re¬ 
cently emerged. Instead of attempting to match the whole 
image to the query, the problem is changed to finding 
a rectangular region in the image that best matches the 
query [40, 32]. An inefficient search by sliding window is 
intractable for large collections of images. The exhaustive 
enumeration is approximated by similarity evaluation on a 
number of pre-selected regions. The regions are either se¬ 
lected geometrically to cover the whole image at different 
scales, as in R-MAC [40], or by considering the content by 
object or region proposal methods [32, 37, 9]. 
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Another direction of suppressing irrelevant content is 
saliency detection [18, 2 5]. For each image, a saliency map, 
that captures more general region shapes compared to (a 
small set of) rectangles, is first estimated. The contribution 
of each pixel (or region) is then proportional to the saliency 
of that location. 

In this work we introduce a very simple pooling scheme 
that inherits the properties of both saliency detection and 
region based pooling and that, like all previous approaches, 
is applied to each image in the database independently. In 
addition, we investigate the use of the resulting regional rep¬ 
resentation for automatic, offline object discovery and sup¬ 
pression of background clutter, which considers the image 
collection as a whole. Unlike previous approaches, we do 
this in an unsupervised way. As a consequence, our rep¬ 
resentation takes two saliency detection steps into account. 
One that acts per image and depends solely on its content 
and another that considers the image collection as a whole 
and captures frequently appearing objects. 

In both cases, we derive a global representation that out¬ 
performs comparable state-of-the-art methods in retrieving 
small objects on standard benchmarks, while the memory 
footprint and online cost is only a fraction compared to 
more powerful regional representations [31, U ]. Moreover, 
we show that our representation benefits significantly from 
query expansion methods. 

Section 2 discusses our contributions against related 
work. Section 3 describes our methodology including our 
pooling scheme in Section 3.3 and our object discovery ap¬ 
proach in Section 3.8. We present experimental results in 
Section 4 and draw conclusions in Section 5. 

2. Related work 

Local features and geometric matching offer an attrac¬ 
tive way for retrieval systems to handle occlusions, clutter, 
and small objects [36, 29, 14]. One of their drawbacks is 
high query complexity and large storage cost; an image is 
typically represented by several thousands features. Many 
methods attempt to decrease the amount of indexed features 
by removing background clutter while maintaining the rele¬ 
vant information. The selection procedure is either applied 
independently per image or considers an image collection as 
a whole. Common examples of the former case are bursty 
feature detection [34], symmetry detection [39] or use of 
semantic segmentation [1, 27]. The methods of the second 
category, are scalable enough to jointly process the whole 
collection and perform feature selection by the following 
assumption. A feature that repeats over multiple instances 
of the same object in the dataset is likely to appear in novel 
views of the object too. Representative cases are common 
object discovery [41, 38], co-occurrence detection [6], or 
methods using GPS information [8, 2C]. 

The work by Turcot and Lowe [4 ] performs pairwise 


spatial verification on hand-crafted local features across all 
images and only indexes verified features. With an addi¬ 
tional off-line cost, the on-line stage is sped up and the 
memory footprint is reduced. However, unique views of 
objects are not verified and thus discarded. In this work, we 
address a similar selection problem based on more powerful 
CNN-based representation rather than local features. 

Recent advances on deep learning [3, 40, 18, 10, 30] 
dispense with the large memory footprint by using global 
descriptors and cast the problem of instance search as Eu¬ 
clidean nearest neighbor search. Nevertheless, background 
clutter and occlusion are better handled by regional repre¬ 
sentation. Regional descriptors significantly increase the 
performance when they are indexed independently [31, L ] 
but this comes at a prohibited memory and computational 
cost for large scale scenarios. Region Proposal Networks 
(RPN) are applied either off-the-shelf [32] or after fine- 
tuning [3 ] for instance search. The RPNs reduce the num¬ 
ber of regions per image only to the order of tens. Our work 
focuses on aggregating regional representation that keeps 
the complexity low but we rather detect regions around 
salient objects and objects that frequently appear in the 
dataset. Jimenez et al. [16] construct saliency maps and 
perform region detection to construct global image vectors, 
as we also do. However, they employ generic object de¬ 
tectors trained on ImageNet and this makes the method not 
applicable with fine-tuned networks which provide the best 
performance. The Hessian-affine detector is used on CNN 
activations to detect repeatable regions [15]. The major ben¬ 
efit in this work, though, comes from second order pooling 
and higher dimensional descriptors. 

Saliency maps are another way to handle clutter and oc¬ 
clusions. Once more, there exist both examples of computa¬ 
tion in an unsupervised manner [18, 21] or learned [25, 17] 
and applied per image afterwards. Our approach generates 
saliency maps in a fully unsupervised way that capture both 
salient objects on single images but also repeating objects 
appearing in a particular image collection. 

3. Method 

Like [41], our objective is to remove transient and non- 
distinctive objects as in Figure 1 and rather focus on objects 
appearing frequently in a dataset. Beginning with the acti¬ 
vation map of a convolutional layer in a CNN, one would 
need access to a local representation to automatically dis¬ 
cover such objects. On the other hand, knowing what these 
objects are would help forming a local representation by 
selecting regions depicting them, which appears to be a 
chicken-and-egg problem. Without an initial region selec¬ 
tion, we risk “discovering” uninformative but frequently ap¬ 
pearing “stuff’-like patches, for instance sky. 



Figure 2. Overview of our offline unsupervised process. On the 
top row, CNN activations of dataset images are used to extract a 
feature saliency map, on which a set of regions is detected. On 
the bottom row, a centrality measure is obtained per region from 
a region fc-NN graph. Using this measure, a dense object saliency 
map is formed from the original CNN activations and the feature 
saliency. This map is focusing on objects automatically discovered 
in the dataset, with background clutter removed. Finally, another 
set of regions is detected on the object saliency map to extract 
descriptors and represent the dataset for retrieval. 


3.1. Overview 

Fortunately, it is possible to make an initial selection 
based on CNN activations alone, without any training and 
without bounding box annotations. As described in Sec¬ 
tion 3.3, the mechanism is inspired by CroW [18] and Grad- 
CAM [33] and generates a feature saliency map. This initi¬ 
ates our offline analysis illustrated in Figure 2. A small set 
of rectangular regions is detected per image from this map 
as discussed in Section 3.4. This first round of detection 
is applied independently per image and depends only on its 
content. 

Each region in the dataset is associated to a feature 
saliency score and a visual descriptor, pooled from the ac¬ 
tivation map of the corresponding image, as discussed in 
Section 3.5. It is now possible to compute a centrality score 
per region, representing the “significance” of each region in 
the dataset. This is based on a region fc-NN graph and is 
discussed in Sections 3.6 and 3.7. 

Now, given a new image, we can infer the “significance” 
of every region from its nearest neighbors in the graph, 
yielding a dense object saliency map as discussed in Sec¬ 
tion 3.8. This is a regression problem and we suggest a 
non-parametric fc-NN solution. Finally, we detect a small 
set of rectangular regions on this saliency map and extract a 
global descriptor to represent dataset images for retrieval, as 
discussed in Section 3.9. This second detection procedure 


takes into account all salient and repeating objects appear¬ 
ing in the dataset. 

The entire process is fully unsupervised and only as¬ 
sumes on the-shelf networks trained on a classification or 
retrieval task without bounding box annotations. 


3.2. Representation 

We represent the activation map of a convolutional layer 
as a non-negative 3d tensor A E R hxwxc where h,w are 
the spatial resolution (height, width) and c is the number of 
feature channels. The set of valid spatial positions is P : = 
[h\ x [w ] 1 and the set of all rectangles with vertices in P 
is denoted by 1Z. By A p j we represent an element of A at 
position p E P and channel j E [c]. By A.j E M. hxw we 
denote the 2d feature map of A corresponding to channel 
j G [c] . By A p . G M c we denote the vector containing all 
feature channels at position p G P. 


3.3. Feature saliency 


Inspired by cross-dimensional weighting and pooling 
(CroW) [18] and class activation mapping (CAM) [45], we 
construct a 2d saliency map of an image based on a con¬ 
volution activation of that image alone. Following CroW, 
we compute an idf-like weight per channel b G M c with 
elements 


log 


(a + e) T 1 \ 
cij + e J 


( 1 ) 


for j e [c], where a := G R c is the 

average number of nonzero elements per channel. We then 
compute a weighted sum over channels 


F:=J2 b i A -i ( 2 ) 

je[c\ 


Finally, we obtain the 2d feature saliency (FS) map F G 
I \ hxw by normalizing F according to [18]. Contrary to 
CroW, we use the feature channel weights when comput¬ 
ing the 2d spatial weights, amplifying channels with sparse 
activation. This order of summation is the same as in CAM. 
However, we are working with channel weights obtained by 
a sparsity property on any convolutional layer, without any 
assumption on the network topology. CAM on the other 
hand, assumes global average pooling followed by a fully 
connected layer mapping channels to classes and uses the 
parameters of this layer to obtain a saliency map per class. 

3.4. Region detection 

We are given a 2d saliency map S , which can be either 
the feature saliency described in section 3.3 or the object 
saliency described in Section 3.8. We use an expanding 
Gaussian mixture (EGM) model [2] to detect a number of 

! Here, [z] is the set {1,..., i} for i E N. 








































image i = 0, m = 272 i = 2, m = 29 i = 3, m = 22 i = 5, m = 17 ?' = 7, ra = 11 i = 14, m = 9 

Figure 3. Evolution of regions during EGM iterations on the feature saliency map of an image of Magdalen tower from Oxford buildings 
dataset, shown on the left. Below each image we display the iteration i and the number of regions m. 


salient rectangular regions. This is a variant of expectation- 
maximization (EM) that iteratively performs local averag¬ 
ing (E- and M-steps) interleaved with a selection process 
(P-step) similar to non-maximum suppression (NMS). In 
doing so, it dynamically estimates the number of regions. 

The original algorithm applies to point sets and isotropic 
Gaussian components. Here we extend it to functions, con¬ 
sidering that a saliency map is a function S : P M. We 
use it to fit a number of components, each modeling a rect¬ 
angular region in 2d coordinate space. We also extend it to 
a diagonal covariance model, so that a rectangle is modeled 
by an axis-aligned ellipse. 

In particular, given 2d saliency map S £ R hxw , we rep¬ 
resent it as a set of Gaussian functions Si : M 2 M with 

s*(x) : = S Pi Af (x|pi, al 2 ) (3) 

for i £ [£], x £ M 2 where Af is the normal density, 
£ = \P\ is the number of positions and we represent P as 
{pi,... ,Pi}. Here, cr is a scale parameter that determines 
how coarse or fine the region representation will be for the 
given saliency map. Similarly, we represent components as 
Gaussian functions : M 2 —>■ R with 

tffc(x) : = 7r k Af(x.\lJLk, Sfe) (4) 

for k £ [m], x £ M 2 , where m is the number of components 
and 7 Tfc £ M, p k G M 2 and £ M 2x2 are the mixing coef¬ 
ficient, mean and diagonal covariance matrix respectively 
of component k. Means represent region centers, while 
the (inverse) eigenvalues of covariance matrices represent 
heights and widths. We initialize components as q k <— s k 
for k £ [m\, with m L In the expectation (E)-step, we 
compute the responsibility 


lik E- 


(SiiQk} 

^2je[m\ ( S i^£lj) 


(5) 


of component k £ [m\ for sample i £ [^], where (/, p) is the 
L 2 inner product of square-integrable functions /, g : R d 
M, computed in closed form for Gaussian functions [2]. In 


the maximization (M)-step, we update parameters as 


£k 

Kk < - t 


1 U 

Vk ^ 7 ikPi 


i=l 


Sfc — ^'Yik diag (pi - n k )° 2 


i=l 


( 6 ) 

(7) 

( 8 ) 


where £ k : = 7 ^ is the effective number of points as¬ 

signed to component k and I° 2 : = X o X is the Hadamard 
product power for a vector or matrix X. 

Finally, in the purge (P)-step, similarly to NMS, we pro¬ 
cess components in descending order of mixing coefficient 
and we decide whether to keep a component or not depend¬ 
ing on its overlap with the collection of previously kept 
components. Overlap is measured by a generalized respon¬ 
sibility function similar to (5), and again inner products are 
given in closed form [2]. This means that the number of 
components m is potentially reducing at each iteration. 

Figure 3 shows how regions are formed during EGM it¬ 
erations, starting from one small region centered on each 
spatial position. We get 4 clean regions on the ground truth 
building, as well as 6 regions on background objects, which, 
although less salient, cannot be removed based on the fea¬ 
ture saliency alone. 

3.5. Region pooling 

Given a rectangular region R £ 1Z of an image with fea¬ 
ture saliency map F £ R hxw , we associate to it feature 
saliency f : = pp(R) £ M, where 

Mp(^) • = r^r Fp (9) 

1 1 P eR 


is the average of 2d map F over R. 

In addition, given the activation map A £ U hxwxc 0 f 
the same image, it is standard practice that a descriptor is 
obtained by pooling over R, for instance sum [4], weighted 














sum [1^] or max [3, 40] pooling. We adopt the latter choice 
to extract descriptor z : = 771,4 (i?) E M c , where 

rriA(R) : = ma(10) 

is the maximum of 3d tensor A over R along the spatial 
dimensions. This has been the basis of fine-tuning in [30, 9]. 

A particular set of regions, uniformly sampled on a grid 
at different scales, is referred to as regional maximum acti¬ 
vation of convolutions (R-MAC) [40]. Global description, 
referred to as MAC, is a special case where there is a single 
region R = P. In contrast, we detect a set of regions based 
on saliency maps in this work. 

Finally, we follow [30] in performing supervised whiten¬ 
ing of the descriptors by simultaneous diagonalization [23 ]. 
In particular, given vector z £ M c , we £ 2 -normalize, cen¬ 
ter, whiten, PCA-project and renormalize to generate the 
region descriptor v : = w(z) £ for region R. Function 
w : M c R d represents this pipeline entirely. 

3.6. Graph construction 

Given an image dataset, we assume here a set of regions 
{Ri, • • •, R n } are detected from the saliency maps (Sec¬ 
tion 3.4), a feature saliency vector f : = (/i,..., f n ) £ W 1 
is computed with the corresponding average saliency per re¬ 
gion in (9), and a set of descriptors V : = {vi,..., v n } C 
W 1 are extracted from the activation maps, whitened and 
normalized per region (Section 3.5). 

Based on the above information, we construct a fc-NN 
graph on those regions in order to compute a global cen¬ 
trality score per region as discussed in Section 3.7, which 
enables us to form an object saliency map on a new im¬ 
age, described in Section 3.8. Approximate techniques for 
fc-NN graph construction [7] can be used to handle large- 
scale databases. 

We construct a weighted undirected graph having the 
set of descriptors V as vertices. Following [13], the edge 
weights are defined according to mutual k-nearest neigh¬ 
bors (NN) in the descriptor space. In particular, given 
descriptors v, u £ M d , we measure their similarity by 
s(v, u) = (v T u)^, where exponent /3 > 0 is a parame¬ 
ter. We define the sparse symmetric nonnegative adjacency 
matrix W £ M nxn with elements Wij being s(v*,Vj) if 

Vj are mutual fc-NN in V and zero otherwise. 

We define the n x n degree matrix D : = diag(VFl) 
where 1 £ M n is the all-ones vector, and the symmetrically 
normalized adjacency matrix 

W := D~ 1/2 WD~ 1/2 , (11) 

with the convention 0/0 = 0. Following [13, 12], we define 
the n x n matrices L a : = (D — aW)/{l — a) and 

c a : = D _1 / 2 I/ a D _1 / 2 = (7 - aW)/( 1 - a), (12) 

where a £ [0,1). Both are positive-definite [13, 1 ]. 



Figure 4. Computing the object saliency map S of an image from 
Instre dataset (top), as defined in (14). For each patch, its neigh¬ 
bors in the graph (right) are found. Common patterns with high 
centrality in green outline, outliers with low centrality in red. S 
(bottom) then focuses on patches similar to common patterns and 
combines with feature saliency F (left). 


3.7. Graph centrality 

With the above definitions in place, the objective is to 
compute a vector g £ M n where each element gi represents 
the significance of vertex v* in the graph, for i £ [n\. We 
define this centrality vector as the solution g* £ M n of the 
linear system 

g = l. (13) 

As in [13], we solve this system by the conjugate gradients 
(CG) [24] method. Any method would be equally appropri¬ 
ate because this is computed just once offline. 

The solution g* is a graph centrality measure [22], and 
in particular, Katz centrality [1 ]. Centrality is a global 
measure of significance of vertices in a graph, and PageR- 
ank [28] is maybe the most well-known. In fact, Katz cen¬ 
trality was introduced as such a global measure before be¬ 
ing adapted by boundary condition y to measure relevance 
to individual vertices by Hubbell [11]. This work has a long 
history before being rediscovered e.g. by [28, 46], as sum¬ 
marized in the study of spectral ranking [42] . 

3.8. Saliency map construction 

Given the region descriptor set V, the region saliency 
vector f and the associated centrality vector g* of an en¬ 
tire dataset, the problem is to construct a new saliency map 
S £ R hxw for an image in the dataset. The image is rep¬ 
resented by its activation map A £ ]^ hxwxc m Since this 
saliency is based on regions or patterns appearing frequently 
in the dataset, which are commonly associated to repeating 
objects, we call it object saliency (OS). 

We compute S' by a sliding window iteration over each 
position p £ P. The saliency value S p at p is found as 
a linear combination of the centrality values of the nearest 
neighbors in V of a patch centered at p. In particular, we 
consider a square patch R p of side a centered at p. We com¬ 
pute the vector u p : = w(rriA{R p )) £ by max-pooling 





















over R p , whitening and normalizing as discussed in Sec¬ 
tion 3.5. If N p is the set of indices of the fc-NN of u p in V, 
we compute S p as 

S p '■ = Fp X s ( v i’ u p)fi9i- (14) 

ieN p 

That is, each neighboring region descriptor is weighted 
by its similarity to patch descriptor u p , its feature saliency 
fi and its centrality g*, while the entire sum is scaled by 
the feature saliency F p at the current position p of the im¬ 
age being considered. Exponents 0 and 6 control the rela¬ 
tive importance of feature saliency of the current image and 
neighbors, respectively, compared to centrality. The object 
saliency computation is illustrated in Figure 4. Looking at 
the input image and is feature saliency map F alone, it is 
not evident which is the object of interest and which is clut¬ 
ter. This is only found by discovering other instances of the 
same object in the dataset, as represented by the graph. 

3.9. Representation 

The object saliency map S highlights patterns that ap¬ 
pear frequently in the dataset, with the background clutter 
removed. It is only natural then to apply the same method 
described in Section 3.4 to this map in order to detect a 
small number of regions per image. Unlike the regions de¬ 
tected from the feature saliency map F, these new regions 
are more likely to appear in a new image. For the purpose 
of evaluation, we investigate both saliency maps. 

For each region R detected from a saliency map (F 
or S) in a dataset image with activation map A, we ap¬ 
ply max pooling and £ 2 -normalization. All descriptors are 
then summed and the resulting descriptor is whitened with 
w : M c —>■ as described in Section 3.5. The difference 

here is that we apply whitening on the aggregated vector 
and not separately per region. This is the same representa¬ 
tion as R-MAC evaluated in [30] and both yield a global im¬ 
age representation in R d , but here the regions are detected 
in the saliency map rather than being uniformly distributed. 

Pooling based on saliency is in fact the idea explored 
in CroW [18], but here we follow the nonlinear two-level 
pooling of R-MAC (max followed by sum) rather than the 
one-level sum of CroW. This is more powerful and has also 
been the basis of fine-tuning in [ ]. 

4. Experiments 

We apply the proposed representation on image retrieval. 
In particular, we have two variants of our method that both 
use the region detection described in Section 3.4. The 
saliency map which the detection is performed on is differ¬ 
ent in each case. FS.egm uses the feature saliency map de¬ 
scribed in Section 3.3, and OS.EGM uses the object saliency 
map described in Section 3.4. The former is image specific, 
while the latter both image and database specific. 


4.1. Experimental setup 

Test sets. We evaluate on Oxford Buildings [29] and the 
more recently introduced Instre [43] dataset. Instre con¬ 
tains around 27k images of small objects in cluttered scenes 
while objects appear with different variations, such as rota¬ 
tion, occlusion and scale changes, making it a challenging 
case. We use the evaluation protocol introduced in [13] for 
Instre. We add 100k distractors from Flickr [29] to Ox- 
ford5k to perform experiments at larger scale. We refer to it 
as Oxford 105k. Search performance in all datasets is mea¬ 
sured with mAP. 

Image Representation. We represent each image by global 
image representation as described in Section 3.9. This re¬ 
duces image similarity to cosine, which is common prac¬ 
tice [40]. Feature extraction is performed with the VGG 
network [3 ] that is fine-tuned specifically for image re¬ 
trieval [3 ]. Supervised whitening [30] is used for post¬ 
processing. The same network is additionally used to com¬ 
pare against two baselines. First, MAC global descriptor, 
which is obtained by global max pooling and the descriptor 
that the network is directly optimized for [30]. Second, the 
baseline approach (Uniform), which refers to regional max 
pooling for regions that are uniformly sampled at 3 scales, 
as in R-MAC [40]. Our variants are different in that regions 
are detected from salient and repeating objects, while ag¬ 
gregation and whitening is identical. Detection is applied 
to dataset images only, while we use the provided bounding 
boxes on the query side. 

Implementation Details. To simplify region detection, 
each saliency map is masked above threshold r and 
element-wise raised to exponent p before detection, which 
removes the weakest regions and increases the contrast be¬ 
tween foreground and background objects. We set p = 1, 
r = 0.2 and scale parameter a = 1 before any parame¬ 
ter tuning is performed. We determine OS parameters 0, 0 
in (14) by visual inspection of OS and set 0 = 2, 6 = 3 
throughout our experiments. We perform our experiments 
on a 16-core Intel Xeon 2.00GHz CPU. It takes 36s to create 
the graph on Instre, while centrality computation takes neg¬ 
ligible amount of time. It takes 0.02s for FS computation 
and detection per image, while 0.23s in the case of OS. 

4.2. Parameter tuning 

In this section, we show the impact of FS.EGM and 
OS.EGM detection parameters on the retrieval performance. 
We tune the parameters on Oxford5k when using diffu¬ 
sion [13]. The remaining experiments evaluate the proposed 
representation with the chosen parameters on Instre and Ox- 
fordl05k as well. 

Feature saliency detection is evaluated first by FS.egm, 
while we do not compute object saliency and OS.EGM yet. 
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Figure 5. mAP on Oxford5k versus saliency exponent p for 
FS.EGM and OS.EGM. 



Figure 6. mAP on Oxford5k versus threshold r for FS.EGM and 
OS.EGM. 



Figure 7. mAP on Oxford5k versus EGM scale parameter a for 
FS.EGM and OS.EGM. 

Figure 5 shows the effect of p , which controls the contrast 
of the the saliency map. We observe that large p is needed 
to remove as much clutter as possible from the noisy FS 
activations. We set p = 5 for the rest of our experiments. 
Figure 6 shows the effect of threshold r, which is another 
selectivity parameter. We set r = 0.4. Scale cr is used dur¬ 
ing EGM sampling as explained in Section 3.4. Its impact 
in performance is shown in Figure 7. Setting a = 2.5 re¬ 
sults in good performance and regions that are large enough 
for FS.EGM. 

Object saliency detection is then evaluated once the fea¬ 
ture saliency parameters are fixed, and EGM detection is 
applied on the new saliency map. We observe that OS be¬ 
haves quite differently to FS, because foreground objects 
are much cleaner. The impact of parameters a and p is 
shown in Figures 5 and 7 respectively. It is remarkable that 
a much lower exponent is needed in this case. We choose 
p = 2 and a = 2. Finally, we fix r = 0 for OS, as the 
saliency maps obtained with OS are exactly zero at back¬ 
ground regions. The effect is shown in Figure 6. 

4.3. Evaluation of saliency maps 

We exploit the fact that Instre dataset comes with bound¬ 
ing box annotation for all database images. We use 
the ground truth information to quantitatively evaluate the 
saliency maps. We define precision as the sum of saliency 
over ground truth regions, normalized by the sum over the 



Saliency precision 


Figure 8. Histogram of saliency precision for FS and OS maps 
measured on all images of Instre. 









image FS.EGM OS.EGM 


Figure 9. Examples of images from Oxford5k (first 2 rows) and 
Instre (last 3 rows) datasets, along with smoothed FS and OS maps 
superimposed on the images and regions detected by EGM, in red. 


entire image, and we measure it for FS and OS as shown 
in Figure 8. High precision means that a saliency map is 
well aligned to the ground truth bounding boxes. Given 
that these bounding boxes are not used anywhere, the im¬ 
provement that OS offers is impressive. Visual examples 
for saliency maps and detections for FS.EGM and OS.EGM 
are shown in Figure 9. In all cases, OS is cleaner and fo¬ 
cuses on objects that FS cannot discriminate. 

4.4. Comparison to other methods 

We compare our methods to the standard practice of uni¬ 
form region sampling (Uniform) as in R-MAC and global 
max pooling (MAC). We additionally propose a variant of 
OS.EGM, where further uniform region sampling at 2 scales 

















































Method 

QE 

Instre 

Oxford 

Oxford 105 k 

MAC 

- 

48.5 

79.7 

73.9 

Uniform [ 3] 

- 

47.7 

77.7 

70.1 

FS.EGM * 

- 

48.4 

77.5 

70.2 

OS.EGM * 

- 

50.1 

79.6 

71.8 

OS.egm-A* 

- 

53.7 

79.8 

71.4 

MAC 

/ 

71.8 

87.4 

86.0 

Uniform [ 3] 

/ 

70.3 

85.7 

82.7 

FS.EGM * 

/ 

71.2 

89.8 

87.9 

OS.EGM * 

/ 

72.7 

90.4 

88.0 

OS.egm-A* 

/ 

75.4 

90.1 

84.3 


Table 1. mAP comparison of our methods marked with * against 
baselines on all tested datasets. QE refers to query expansion by 
diffusion [13]. 

is performed within each detected region. We refer to this 
as OS.egm-A. All methods are tested with k -NN search 
and global diffusion [1 ], which is a method for query ex¬ 
pansion or manifold search and is known to significantly 
improve performance. Results are given in Table 1. 

FS.EGM improves performance compared to uniform 
sampling by focusing on salient objects. However, salient 
objects are not necessarily relevant for the particular dataset. 
This is what OS.EGM captures and boosts the search perfor¬ 
mance, especially on Instre. On all datasets, MAC is bet¬ 
ter than uniform sampling (R-MAC). This is because the 
network used [3( ] is directly fine-tuned to optimize MAC. 
However, when using diffusion, we outperform it on all 
datasets. This can be explained by the fact that diffusion 
boosts any items that are similar to the top-ranking ones 
according to the original similarity [13], so it is essential 
that these items are reliable. A global descriptor is affected 
by clutter in general. By contrast, our representation is 
global yet clutter-free. Our improvements are larger on In¬ 
stre, which is more challenging due to small objects and 
severe background clutter. This is exactly where our detec¬ 
tion is essential. Most Instre images are also quite different 
than the building images which the network is fine-tuned 
on. This is probably why our representations outperform 
MAC even without diffusion on this dataset. 

There are several other previous approaches that deal 
with region detection or saliency masks, which are not di¬ 
rectly comparable, so they are not included in Table 1 . Nev¬ 
ertheless, we outperform their reported results. Salvador et 
al. [32] use the off-the-shelf VGG and fine-tune RPN in the 
test set. Without using query expansion, they obtain 71.0 in 
Oxford5k. Similarly, Jimenez et al. [16] learn class weights 
and apply them on the activation maps of off-the-shelf VGG 
and achieve 73.6 in Oxford5k. Song et al. [37] train on dif¬ 
ferent datasets, and achieve 78.3 in Oxford5k. The results 
obtained by learning a saliency mask are not comparable 
since spatial verification with local features is always ap¬ 
plied in the end [25]. Finally, Zheng et al. [44] achieve 83.4 



Figure 10. mAP comparison of our global OS.EGM (*) to R-Match 
with uniformly sampled regional descriptors, with and without dif¬ 
fusion on Oxford5k. Text labels refer to query time. 

with regional representation on Oxford5k. They employ 
both CNN and local features, while we only rely on CNN 
and much more compact representation. Finally, no work 
other than [13] evaluates on Instre which is rather challeng¬ 
ing due to small objects. 

Region cross-matching methods [31] represent an image 
with multiple vectors, sacrificing memory footprint and 
complexity for accuracy. In particular, the memory is linear 
in the number of regions, while the complexity is quadratic. 
We compare our global representation with region cross¬ 
matching (R-Match) and regional diffusion [13] in Fig¬ 
ure 10. Different numbers of regions are obtained by GMM 
reduction, exactly as in [12]. 

Compared to regional descriptors, we require about 4 
times less memory to achieve the same performance. The 
runtime complexity gain is in the order of 4 2 , which holds 
for the case of R-Match and also for the first part of dif¬ 
fusion where Euclidean nearest neighbors are found. The 
diffusion complexity is O (m), where m is the number of 
non-zero entries of the graph. We found that m is 3.7 times 
smaller in our case and our measurements of actual query 
timings agree with this ratio. 

5. Conclusions 

We propose a region detection approach that is dataset 
specific but requires no supervision. It captures not only 
salient objects by considering each image individually but 
also frequently appearing ones by considering the dataset 
as a whole. As a result, we avoid separate indexing of re¬ 
gional descriptors and construct a global descriptor by pool¬ 
ing over data-dependent regions, which performs well under 
background clutter and severe occlusions. We demonstrate 
that this approach is effective in particular object retrieval 
where background clutter is a common problem. 
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